Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata
Attaching package: 'openintro'
The following object is masked from 'package:gt':
sp500
library(ggrepel)library(patchwork)library(ggh4x)library(jpeg)library(ggpubr)# set theme for ggplot2ggplot2::theme_set(ggplot2::theme_minimal(base_size =11))# set width of code outputoptions(width =65)# set figure parameters for knitrknitr::opts_chunk$set(fig.width =7, # 7" widthfig.asp =0.618, # the golden ratiofig.retina =3, # dpi multiplier for displaying HTML output on retinafig.align ="center", # center align figuresdpi =300# higher dpi, sharper image)if (!require("pacman")) install.packages("pacman")
Loading required package: pacman
# use this line for installing/loadingpacman::p_load(tidyverse, glue, scales, ggthemes, ggh4x) devtools::install_github("tidyverse/dsbox")
WARNING: Rtools is required to build R packages, but is not currently installed.
Please download and install Rtools 4.5 from https://cran.r-project.org/bin/windows/Rtools/.
Using GitHub PAT from the git credential store.
Skipping install of 'dsbox' from a github remote, the SHA1 (244ecdfe) has not changed since last install.
Use `force = TRUE` to force installation
1 - Du Bois challenge.
# Read in data from income fileincome <-read_csv(here("data" ,"income.csv"))
Rows: 7 Columns: 7
── Column specification ─────────────────────────────────────────
Delimiter: ","
chr (1): Class
dbl (6): Average_Income, Rent, Food, Clothes, Tax, Other
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: Removed 2 rows containing missing values or values outside the
scale range (`geom_text()`).
Sources
Hex color values: https://htmlcolorcodes.com/colors/
Referenced for percent function parameters: https://scales.r-lib.org/reference/percent_format.html
Referenced for ifelse usage: https://stat.ethz.ch/R-manual/R-devel/library/base/html/ifelse.html
2 - COVID survey - interpret
Example 1:
For the response variable “Getting the vaccine will make me feel safer at work”, nurses had an average response close to 1, with a relatively narrow confidence interval. This indicates that most nurses agreed with the statement, and the estimate is fairly precise. Medical professionals, on the other hand, had a mean closer to 2 with a wider confidence interval. This suggests that while medical professionals also tended to agree, they were on average slightly less certain than nurses. Additionally, the wider confidence interval for medical professionals indicates greater variability in their responses and less precision in the estimate compared to the nurses. The responses did surprise me. I initially thought that medical professionals might be more confident than nurses because they typically have more training. However, on second thought, the confidence interval for the medical professionals group is much wider, suggesting greater variability and less precision in their responses.
Example 2:
For the response variable “I am concerned about the safety and side effects of the vaccine”, the Non-Hispanic/Non-Latino group has a slightly higher mean (around 3.5) with a relatively wide confidence interval. The Hispanic/Latino group has a mean closer to 3, also with a wide confidence interval. Notably, the confidence interval for the Non-Hispanic/Non-Latino group is slightly wider than that of the Hispanic/Latino group. These results suggest that, on average, both groups are neutral in their concern about the vaccine’s safety and side effects. However, the wide confidence intervals indicate a fair amount of uncertainty around these means, making the estimates less precise. Though the overlap between intervals suggests no statistically significant difference between the two groups. The results do not surprise me; regardless of ethnicity, I would expect both groups to have similar opinions.
Example 3:
For the response variable “I trust the information that I have received about the vaccines”, the group that reported having received the COVID vaccine had a mean close to 1, with a narrow confidence interval. This indicates strong agreement with the statement and a high level of precision in the estimate. In contrast, the group that reported not having received the vaccine had a mean around 3, with a wide confidence interval. This suggests that those who were unvaccinated were generally neutral in their trust of vaccine information, and the wide confidence interval reflects greater uncertainty and less precision in this group’s response. The results make sense; I would expect unvaccinated individuals to be less confident in the information compared to those who have received the vaccine.
3 - COVID survey - reconstruct
# Read in data from COVID survey file and skip first rowcovid_survey <-read_csv(here("data", "covid-survey.csv"), skip =1)
Rows: 1121 Columns: 14
── Column specification ─────────────────────────────────────────
Delimiter: ","
dbl (14): response_id, exp_profession, exp_flu_vax, exp_gende...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print dimensions of data framedim(covid_survey)
[1] 1121 14
# Data cleanup - eliminate any rows where all values aside from response_id are missingcleaned_survey <- covid_survey |>filter(!if_all(-response_id, is.na))# Print dimensions of cleaned data framedim(cleaned_survey)
# Calculate the 10th percentile, mean, and 90th percentile of each of the response variables for each level of each explanatory variable.covid_survey_longer <- relabeled_survey |># Takes all the columns starting with "exp_" and creates them into two columns: explanatory and explanatory_value.pivot_longer(cols =starts_with("exp_"),names_to ="explanatory",values_to ="explanatory_value" ) |>filter(!is.na(explanatory_value)) |># Takes all the columns starting with "resp_" and creates them into two columns: response and response_value.pivot_longer(cols =starts_with("resp_"),names_to ="response",values_to ="response_value" )# Print dimensions of data frame and confirm tribble table matches homework covid_survey_longer
# Bind both summary tables togethercovid_survey_summary_stats <-bind_rows( covid_survey_summary_stats_all, covid_survey_summary_stats_by_group)# Check final dimensions dim(covid_survey_summary_stats)
[1] 132 6
# View final summarycovid_survey_summary_stats
# A tibble: 132 × 6
explanatory explanatory_value response mean low high
<chr> <fct> <chr> <dbl> <dbl> <dbl>
1 All "" resp_concern_… 3.28 1 5
2 All "" resp_confiden… 1.43 1 2
3 All "" resp_feel_saf… 1.36 1 2
4 All "" resp_safety 2.03 1 5
5 All "" resp_trust_in… 1.40 1 2
6 All "" resp_will_rec… 1.21 1 2
7 exp_age_bin "<20" resp_concern_… 3.35 2 4.4
8 exp_age_bin "<20" resp_confiden… 1.65 1 2.4
9 exp_age_bin "<20" resp_feel_saf… 1.71 1 3.8
10 exp_age_bin "<20" resp_safety 1.41 1 2
# ℹ 122 more rows
covid_survey_summary_stats <- covid_survey_summary_stats |>mutate(response =factor(response, levels =c("resp_safety", "resp_feel_safe_at_work", "resp_concern_safety","resp_confidence_science", "resp_trust_info", "resp_will_recommend" )),grouping =case_when( explanatory =="exp_already_vax"~"Had COVID\nvaccine", explanatory =="exp_flu_vax"~"Had flu\n vaccine this\nyear", explanatory_value ==""~"All", explanatory_value %in%c(">30", "26-30", "21-25", "<20") ~"Age", explanatory_value %in%c("Prefer not to say", "Non-binary/Third gender", "Male", "Female") ~"Gender", explanatory_value %in%c("White", "Native Hawaiian / Other\nPacific Islander", "Black / African American", "Asian", "American Indian /\nAlaskan Native") ~"Race", explanatory_value %in%c("Non-Hispanic / Non-Latino", "Hispanic / Latino") ~"Ethnicity", explanatory_value %in%c("Nursing", "Medical") ~"Profession",TRUE~NA_character_ ),grouping =factor(grouping, levels =c("All", "Age", "Gender", "Race", "Ethnicity", "Profession", "Had COVID\nvaccine", "Had flu\n vaccine this\nyear")),explanatory_value =factor(explanatory_value, levels =c("<20", "21-25", "26-30", ">30", "", "Female", "Male", "Non-binary/Third gender", "Prefer not to say", "American Indian /\nAlaskan Native", "Asian", "Black / African American", "Native Hawaiian / Other\nPacific Islander", "White", "Hispanic / Latino", "Non-Hispanic / Non-Latino", "Medical", "Nursing", "No", "Yes")), )# define labels for response variablesresponse_labels <-c("resp_safety"="Based on my\nunderstanding, I\nbelieve the vaccine\nis safe","resp_feel_safe_at_work"="Getting the vaccine\nwill make me feel\nsafer at work","resp_concern_safety"="I am concerned\nabout the safety\nand side effects of\nthe vaccine","resp_confidence_science"="I am confident in\nthe scientific\nvetting process for\n the new COVID\nvaccines","resp_trust_info"="I trust the\ninformation that I\nhave received about\nthe vaccines","resp_will_recommend"="I will recommend\nthe vaccine to\nfamily, friends,\nand community\nmembers")ggplot(covid_survey_summary_stats, aes(x = mean, y = explanatory_value)) +geom_point(size =0.6) +geom_errorbarh(aes(xmin = low, xmax = high), height =0.3, color ="black", size =0.3) +scale_y_discrete(expand =expansion(mult =c(0.05, 0.05))) +facet_grid( grouping ~ response, scales ="free_y", space ="free_y",labeller =labeller(response =as_labeller(response_labels)) ) +labs(title ="Mean Likert Scores by Group and Survey Question",x ="Mean Likert Score\nError bars range from 10th to 90th percentile",y =NULL ) +theme_minimal(base_size =11) +theme(panel.grid.major =element_blank(),panel.grid.minor =element_blank(),panel.spacing =unit(0.1, "lines"),strip.background =element_rect(fill ="grey90", color ="black"),axis.text.y =element_text(size =5),strip.text.y.right =element_text(size =4, angle = pi /6, margin =margin(t =80, r =1, b =80, l =1)),strip.text.x =element_text(margin =margin(t =5, r =0, b =5, l =0), size =6) )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2
3.4.0.
ℹ Please use `linewidth` instead.
Sources
To not select a column, used ‘-’ directly from: https://stackoverflow.com/questions/49582602/how-not-to-select-columns-using-select-dplyr-when-you-have-character-vector-of
Directly referenced to remove NA data before evaluating expression: https://stat.ethz.ch/R-manual/R-devel/library/base/html/mean.html
Referenced for quantile function: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html
Referenced for ticks at end of bars: https://ggplot2.tidyverse.org/reference/geom_errorbarh.html
Referenced to modify the theme: https://ggplot2.tidyverse.org/reference/theme.html
Referenced to provide more space between rows: https://ggplot2.tidyverse.org/reference/scale_discrete.html
4 - COVID survey - re-reconstruct
# Group data in covid_survey_longercovid_survey_summary_stats_by_group_25_75 <- covid_survey_longer |>group_by(explanatory, explanatory_value, response) |>summarise(mean =mean(response_value, na.rm =TRUE),low =quantile(response_value, 0.25, na.rm =TRUE),high =quantile(response_value, 0.75, na.rm =TRUE),.groups ="drop" )# Group data in covid_survey_longer only by responsecovid_survey_summary_stats_all_25_75 <- covid_survey_longer |>group_by(response) |>summarise(mean =mean(response_value, na.rm =TRUE),low =quantile(response_value, 0.25, na.rm =TRUE),high =quantile(response_value, 0.75, na.rm =TRUE),.groups ="drop" ) |>mutate(explanatory ="All",explanatory_value =factor("") ) |>select(explanatory, explanatory_value, everything())# Bind both summary tables togethercovid_survey_summary_stats_25_75 <-bind_rows( covid_survey_summary_stats_all_25_75, covid_survey_summary_stats_by_group_25_75)covid_survey_summary_stats_25_75 <- covid_survey_summary_stats_25_75 |>mutate(response =factor(response, levels =c("resp_safety", "resp_feel_safe_at_work", "resp_concern_safety","resp_confidence_science", "resp_trust_info", "resp_will_recommend" )),grouping =case_when( explanatory =="exp_already_vax"~"Had COVID\nvaccine", explanatory =="exp_flu_vax"~"Had flu\n vaccine this\nyear", explanatory_value ==""~"All", explanatory_value %in%c(">30", "26-30", "21-25", "<20") ~"Age", explanatory_value %in%c("Prefer not to say", "Non-binary/Third gender", "Male", "Female") ~"Gender", explanatory_value %in%c("White", "Native Hawaiian / Other\nPacific Islander", "Black / African American", "Asian", "American Indian /\nAlaskan Native") ~"Race", explanatory_value %in%c("Non-Hispanic / Non-Latino", "Hispanic / Latino") ~"Ethnicity", explanatory_value %in%c("Nursing", "Medical") ~"Profession",TRUE~NA_character_ ),grouping =factor(grouping, levels =c("All", "Age", "Gender", "Race", "Ethnicity", "Profession", "Had COVID\nvaccine", "Had flu\n vaccine this\nyear")),explanatory_value =factor(explanatory_value, levels =c("<20", "21-25", "26-30", ">30", "", "Female", "Male", "Non-binary/Third gender", "Prefer not to say", "American Indian /\nAlaskan Native", "Asian", "Black / African American", "Native Hawaiian / Other\nPacific Islander", "White", "Hispanic / Latino", "Non-Hispanic / Non-Latino", "Medical", "Nursing", "No", "Yes")), )ggplot(covid_survey_summary_stats_25_75, aes(x = mean, y = explanatory_value)) +geom_point(size =0.6) +geom_errorbarh(aes(xmin = low, xmax = high), height =0.3, color ="black", size =0.3) +scale_y_discrete(expand =expansion(mult =c(0.05, 0.05))) +facet_grid( grouping ~ response, scales ="free_y", space ="free_y",labeller =labeller(response =as_labeller(response_labels)) ) +labs(title ="Mean Likert Scores by Group and Survey Question",x ="Mean Likert Score\nError bars range from 25th to 75th percentile",y =NULL ) +theme_minimal(base_size =11) +theme(panel.grid.major =element_blank(),panel.grid.minor =element_blank(),panel.spacing =unit(0.1, "lines"),strip.background =element_rect(fill ="grey90", color ="black"),axis.text.y =element_text(size =5),strip.text.y.right =element_text(size =4, angle = pi /6, margin =margin(t =80, r =1, b =80, l =1)),strip.text.x =element_text(margin =margin(t =5, r =0, b =5, l =0), size =6) )
The previous plot, which used the 10th to 90th percentiles, displayed the middle 80% of the data. This provided a view of most of the response distribution, observing a wide range of answers while excluding the most extreme values. In contrast, the 25th to 75th percentile range shows the middle 50% of the data. This range represents the most typical or frequently occurring values with less influence from outliers.
As a result, the plots differ. The plot using the 25th to 75th percentile generally displays narrower confidence intervals, making the data more precise. This suggests that while there’s variability in responses, the central half of responses is more consistent than the wider middle 80% range.
In both the 25th to 75th percentile plot and the 10th to 90th percentile plot, for the response “Getting the vaccine will make me feel safer at work,” nurses had an average response close to 1, with a relatively narrow confidence interval. This indicates strong agreement among nurses and a precise estimate of their average response. The confidence interval becomes even narrower in the 25th to 75th percentile plot, reflecting reduced influence from outliers.
Medical professionals, by comparison, had a mean response closer to 2. While their confidence interval also narrows in the 25th to 75th percentile plot, it remains wider than that of the nurses, indicating greater variability in their responses. This suggests that, although medical professionals tended to agree with the statement, they were on average slightly less certain than nurses.
Overall, while the means remain similar across both plots, the 25th to 75th percentile plot offers tighter confidence intervals, highlighting the middle 50% of responses with less influence from outliers and thereby improving precision.
5 - COVID survey - another view
(A)
# pivot responses to long formatlikert_long <- relabeled_survey |>pivot_longer(cols =starts_with("resp_"),names_to ="question",values_to ="response_value" ) |>filter(!is.na(response_value))# calculate percentage of each response value for each questionlikert_summary <- likert_long |>group_by(question, response_value) |>summarise(n =n(), .groups ="drop") |>group_by(question) |>mutate(percentage = n /sum(n) *100)# convert to signed percenatgeslikert_summary <- likert_summary |>mutate(perc_signed =case_when( response_value %in%c("1", "2") ~ percentage, response_value %in%c("3") ~0, response_value %in%c("4", "5") ~-percentage ) )# response variable is numeric and will need to be a factor for scale_fill_manual in the plotlikert_summary <- likert_summary |>mutate(response_value =factor(response_value, levels =c("1", "2", "3", "4", "5")))# define labels for response variablesresponse_labels <-c("resp_safety"="Based on my understanding,\nI believe the vaccine is safe","resp_feel_safe_at_work"="Getting the vaccine will\nmake me feel safer at work","resp_concern_safety"="I am concerned about the safety\nand side effects of the vaccine","resp_confidence_science"="I am confident in the scientific vetting\nprocess for the new COVID vaccines","resp_trust_info"="I trust the information that I have\nreceived about the vaccines","resp_will_recommend"="I will recommend the vaccine to family,\nfriends, and community members")ggplot(likert_summary, aes(x = perc_signed, y = question, fill = response_value)) +geom_col(position =position_stack(reverse =TRUE), width =0.6) +scale_fill_manual(values =c( "1"="#b2182b", "2"="#ef8a62","3"="#7aa381", "4"="#67a9cf", "5"="#2166ac"),name ="Likert Response",labels =c("1", "2", "3", "4", "5") ) +scale_y_discrete(labels = response_labels) +scale_x_continuous(labels = abs) +# Show positive labels on x-axislabs(title ="COVID Vaccine Survey Responses",x ="Percentage",y =NULL ) +theme_minimal(base_size =11) +theme(panel.grid.major.y =element_blank(),axis.text.y =element_text(size =8),legend.position ="bottom" )
The diverging bar chart displays the percentage of responses for each COVID vaccine survey question. The y-axis lists the survey questions, while the x-axis represents the percentage of likert responses. For each question, agree responses (‘1’ and ‘2’) extend to the right, neutral responses (‘3’) are centered at zero, and disagree responses (‘4’ and ‘5’) extend to the left. The length of each bar reflects the percentage of selecting that option. From the chart, it is clear that a large majority of people tended to agree with most of the survey statements. However, there is an exception in the question regarding concern about the safety and side effects of the vaccine with a majority of responses not agreeing.
(B)
# Pivot responses to long formatlikert_long <- relabeled_survey |>pivot_longer(cols =starts_with("resp_"),names_to ="question",values_to ="response_value" ) |>filter(!is.na(response_value))# Calculate percentage of each response value for each questionlikert_summary <- likert_long |>group_by(question, response_value) |>summarise(n =n(), .groups ="drop") |>group_by(question) |>mutate(percentage = n /sum(n) *100)likert_summary <- likert_summary |>mutate(question =factor(question, levels =c("resp_safety", "resp_feel_safe_at_work", "resp_concern_safety","resp_confidence_science", "resp_trust_info", "resp_will_recommend" )),response_value =factor(response_value, levels =c("1", "2", "3", "4", "5")) )# define labels for response variablesresponse_labels <-c("resp_safety"="Based on my understanding,\nI believe the vaccine is safe","resp_feel_safe_at_work"="Getting the vaccine will\nmake me feel safer at work","resp_concern_safety"="I am concerned about the safety\nand side effects of the vaccine","resp_confidence_science"="I am confident in the scientific vetting\nprocess for the new COVID vaccines","resp_trust_info"="I trust the information that I have\nreceived about the vaccines","resp_will_recommend"="I will recommend the vaccine to family,\nfriends, and community members")# Plot diverging bar chartggplot(likert_summary, aes(x = percentage, y = question, fill = response_value)) +geom_col(position =position_stack(reverse =TRUE), width =0.6) +scale_fill_manual(values =c( "1"="#b2182b", "2"="#ef8a62","3"="#7aa381", "4"="#67a9cf", "5"="#2166ac"),name ="Likert Response",labels =c("1", "2", "3", "4", "5") ) +scale_y_discrete(labels = response_labels) +labs(title ="COVID Vaccine Survey Responses",x ="Percentage",y =NULL ) +theme_minimal(base_size =11) +theme(panel.grid.major.y =element_blank(),axis.text.y =element_text(size =8),legend.position ="bottom" )
The 100% bar chart shows the percentage distribution of responses to COVID vaccine survey questions. The colored segments indicate response values from 1 (strongly agree) to 5 (strongly disagree). The chart reveals varying levels of agreement across questions, with generally higher agreement recommending the vaccine to others and lower agreement on vaccine safety concerns.
To compare the bars, it was visually easier to interpret the diverging bar chart. It clearly showed that the majority of people agreed with the survey questions. While the 100% stacked bar chart also indicated that most respondents agreed, it was much more difficult to compare the distribution of individual likert responses, especially those with smaller percentages. The diverging bar chart made it easier to visually assess both the overall trends and the relative size of less popular response categories.
Sources
Summarise each group into single row, directly used: https://dplyr.tidyverse.org/reference/summarise.html
To get info on the group reference: https://dplyr.tidyverse.org/reference/context.html
To select colors used: https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3
Referenced to stack bars on top of each other: https://ggplot2.tidyverse.org/reference/position_stack.html